Skip to content

ci: pilot-migrate clippy job to smithy self-hosted runners#201

Open
avrabe wants to merge 11 commits intomainfrom
smithy-clippy-pilot
Open

ci: pilot-migrate clippy job to smithy self-hosted runners#201
avrabe wants to merge 11 commits intomainfrom
smithy-clippy-pilot

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

First pilot migration of a CI job from GitHub-hosted to the
pulseengine self-hosted fleet (hetzner-private runner group on
pulseengine-ci-01). Scope deliberately small: just the clippy
job, switched to [self-hosted, linux, x64, rust-cpu]. Other jobs
(fmt, test, proofs) stay on ubuntu-latest.

Rationale

  • Spar's recent CI runs show 400-600 min completion times, much of
    which is GitHub-hosted runner queueing on the org-free tier
    (20-concurrent cap).
  • Clippy is meaningful compile work (good sccache integration test)
    but bounded — failure doesn't block format checks or tests.
  • No sudo, apt, or container needed → no friction with our
    rootless runner setup.
  • Spar already pins nightly via dtolnay/rust-toolchain, so the
    toolchain version matches between hosted and self-hosted.

Test plan

  • CI run completes — clippy job lands on a rust-cpu runner (1 of 5/6/7) within seconds (no GitHub queue)
  • Compile succeeds end-to-end with no clippy warnings
  • Other jobs (fmt, test) still run on ubuntu-latest as before
  • Second push to this branch should be much faster on clippy thanks to sccache hit

Rollback

Revert this commit. runs-on: flips back to ubuntu-latest and
the next run uses GitHub-hosted compute.

Follow-ups (if green)

  • Migrate fmt and test next (separate PRs).
  • Add a heavy-quality workflow (mutants-weekly.yml) that targets
    lean-mem runners, separate from gating CI.

avrabe added 6 commits May 3, 2026 07:54
Switches just the clippy job from ubuntu-latest to
[self-hosted, linux, x64, rust-cpu] — one of the three rust-cpu
runners on pulseengine-ci-01 (hetzner-private group).

Other jobs (fmt, test) stay on ubuntu-latest for now; once we have
a few green clippy runs and timing data, the rest can follow.

Why clippy first:
- meaningful compile work (good sccache test)
- bounded scope — failure doesn't block fmt or test
- no sudo, apt, or container needed
- spar already tracks nightly via dtolnay/rust-toolchain so the
  toolchain matches between hosted and self-hosted

If this PR's clippy job goes red on the self-hosted runner but
passes locally / on hosted, that's a smithy bug, not a code bug.
The previous clippy run on the self-hosted runner failed at
highs-sys build because cmake wasn't on the host. smithy main now
ships the common Rust build-dep set (cmake, clang, lld, perl, m4,
protobuf-compiler, libclang-dev, zlib1g-dev). Pushing an empty
commit to re-trigger CI; clippy should now finish on rust-cpu.
Builds on the proven clippy migration (PR description, original
commit on this branch). Two separate concerns:

1) ci.yml — broaden the migration

Migrate every gating job that doesn't need infra we don't have on
the smithy host. Two stay on ubuntu-latest with explicit comments
explaining why; everything else now targets the matching smithy
runner class:

  rust-cpu (12G MemoryHigh)        clippy, test, bench-smoke,
                                   coverage, proptest, fuzz-smoke,
                                   rivet-validate
  lean-mem (24G MemoryHigh)        miri, mutants
  light    (4G  MemoryHigh)        fmt, audit, deny, supply-chain
  ubuntu-latest (kept)             bazel-test (no Bazel on host),
                                   kani (kani-verifier bundles CBMC,
                                   ~100 MB install — not worth pre-
                                   provisioning until kani sees more
                                   use)

The lean-mem class for miri / mutants is deliberate: both are
RAM-aggressive (Miri's borrow tracker, mutants' parallel cargo
invocations). The 24G MemoryHigh ceiling on smithy lean-mem
runners is comfortably above the 12G rust-cpu cap.

2) mutants-weekly.yml — new heavy-quality workflow

Counterpart to the gating `mutants:` job in ci.yml. Different
operational pattern (smithy DD-pattern for "heavy quality"):

  - schedule: 02:00 UTC every Sunday + workflow_dispatch on demand
  - runs-on: lean-mem (24G), timeout-minutes: 720
  - concurrency.cancel-in-progress: false (never cancel a quality run)
  - workflow_dispatch inputs: `shard` (default 0/8 for sanity, "all"
    for the full ~hours pass) + `packages` (space-separated -p list)
  - results land in GITHUB_STEP_SUMMARY (markdown table of
    missed/caught/timeout/unviable) plus an uploaded artefact with
    90-day retention
  - no PR red lights; no auto-Issue filing yet (that's a follow-up
    once the report shape stabilises)

This is the second-pattern pilot the smithy fleet was sized for —
the lean-mem runners have been idle since registration; this puts
them on the work they were labelled for.
GitHub limits workflow_dispatch and schedule triggers to workflows
that already exist on the default branch. Adding a path-filtered
push trigger lets us exercise the workflow on this PR before merge.
The push: block carries a TEMPORARY marker; remove it before merge.
Prior run hit 'Permission denied (os error 13)' on .d files in
target/. Direct file-write tests as the runner user succeed; the
files are owned correctly with mode 640. Suspect: stale state
left by a cancelled run interacting badly with concurrent jobs
landing on the same runner via cache restoration. Clearing all
runner _work and the shared sccache to bisect: if a clean run
also fails, it's not stale state.
Disabled RUSTC_WRAPPER in runner env (smithy commit 65e57a2);
runners restarted to pick up the new environment.
bpftrace running on host capturing every openat returning EACCES
with PID/UID/comm/filename. Pushing this empty commit to fire CI.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

avrabe added 5 commits May 3, 2026 11:32
The action bundles an older cargo-audit that can't parse CVSS 4.0
advisories like RUSTSEC-2026-0037 and exits non-zero on the parse
error before evaluating spar's Cargo.lock. cargo-audit is pre-
installed on smithy at v0.21.2 (toolchains role) which handles
CVSS 4.0 fine.

Same effect (audit blocks PRs on advisory hits) without the wrapper.
Smithy main now ships:
  - subuid/subgid for runner1..8 (Cargo Deny rootless container fix)
  - CARGO_HOME/bin on the runner env PATH (Rivet validate fix)
  - always-on bpftrace EACCES tracing (smithy-trace-eacces.service)

Plus this branch carries:
  - cargo audit invoked directly (replaces broken rustsec/audit-check)

All runners restarted with new env. This commit fires fresh CI.
…roken)

Two adjustments after the smithy subuid + PATH fixes landed:

1. cargo-deny: drop EmbarkStudios/cargo-deny-action@v2 (which runs
   in a rootless container) in favour of direct `cargo deny check`.
   Smithy has cargo-deny installed (toolchains role v0.16.4). The
   container action fails on our hardened runner systemd unit:
   newuidmap is setuid but NoNewPrivileges=true blocks the
   escalation, so the rootless namespace can't be set up. Going
   direct sidesteps the entire interaction; we'd otherwise need to
   weaken the runner hardening for this single workflow.

2. audit: back to ubuntu-latest temporarily. Smithy ships cargo-audit
   v0.21.2 which still rejects RUSTSEC-2026-0037 ('unsupported CVSS
   version: 4.0') even though upstream rustsec 0.30+ supports CVSS
   4.0. v0.22.1 would fix it but that build trips on our
   sccache-on-cc setup (aws-lc-sys C compile through sccache fails).
   Move back once smithy ships an upgraded cargo-audit.
Surfaced when running `cargo deny check` directly with the
toolchains-role-installed cargo-deny v0.16.4 on smithy:

  error[deprecated]: this key has been removed, see
  EmbarkStudios/cargo-deny#611

The yanked + licenses + bans + sources sections still gate
normally. Unmaintained-crate detection moved out of the static
config in newer cargo-deny; revisit if/when we want to re-enable
that signal.
cargo-deny and cargo-audit share the same rustsec advisory parser.
Both fail at the same point on RUSTSEC-2026-0037 because the
embedded rustsec rejects CVSS 4.0 strings. The audit job (on
hosted) still covers vulnerability matching; cargo-deny here keeps
gating bans, licenses, and sources, which is what it actually adds
beyond audit. Drop the workaround once smithy ships an upgraded
rustsec parser (tracked alongside the cargo-audit upgrade).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant